本文介绍了一种新的普通话 - 英语代码转换语音识别的语料库 - 塔尔奇语料库,适用于培训和评估代码转换语音识别系统。滑石乐谱来自TAL教育小组中真正的在线在线一对一的英语教学场景,其中包含大约587个小时的语音采样16 kHz。据我们所知,滑石科目是世界上标签最大的普通话 - 英语密码开关开源自动语音识别(ASR)数据集。在本文中,我们将详细介绍录制过程,包括捕获设备和语料库环境的音频。并且滑石场可以根据允许许可证免费下载。我们使用滑石乐谱,在两个流行的语音识别工具包中进行ASR实验,以制造包括ESPNET和WENET在内的基线系统。在滑石粉中比较了两个语音识别工具包中的混合错误率(MER)性能。实验结果表明,音频记录和转录的质量是有希望的,基线系统是可行的。
translated by 谷歌翻译
及时调整是以参数有效的方式对预训练的预训练语言模型的新范式。在这里,我们探讨了超级核武器的使用来产生超预价:我们提出了HyperPrompt,这是一种用于迅速基于变形金刚自我注意的任务调节的新型体系结构。超预要是通过超网络通过一代人来学习的端到端。 HyperPrompt允许网络学习特定于任务的功能地图,其中超预告是要参与的查询的任务全局记忆,同时启用了任务之间的灵活信息共享。我们表明,HyperPrompt与强大的多任务学习基线具有竞争力,其额外的任务条件参数的$ 0.14 \%$ $ \%,实现了出色的参数和计算效率。通过广泛的经验实验,我们证明,超级启示可以比强大的T5多任务学习基准和参数效率高效的适配器变体获得卓越的性能,包括及时调整和SuplyFormer ++在许多模型尺寸的自然语言理解胶水和SuperGrue的基准上。
translated by 谷歌翻译
Spatiotemporal forecasting has various applications in neuroscience, climate and transportation domain. Traffic forecasting is one canonical example of such learning task. The task is challenging due to (1) complex spatial dependency on road networks, (2) non-linear temporal dynamics with changing road conditions and (3) inherent difficulty of long-term forecasting. To address these challenges, we propose to model the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. Specifically, DCRNN captures the spatial dependency using bidirectional random walks on the graph, and the temporal dependency using the encoder-decoder architecture with scheduled sampling. We evaluate the framework on two real-world large scale road network traffic datasets and observe consistent improvement of 12% -15% over state-of-the-art baselines.
translated by 谷歌翻译
由于其高营养,新鲜度和便利性,包装的新鲜生菜被广泛作为蔬菜沙拉的主要组成部分。然而,生菜剪切边缘的酶促褐变变色可显着降低产品质量和保质期。尽管正在进行许多研究和育种工作以最大程度地减少褐变,但由于缺乏评估褐变的快速和可靠的方法,进展受到阻碍。当前识别和量化褐变的方法过于主观,劳动密集型或不准确。在本文中,我们报告了一个用于生菜褐变预测的深度学习模型。据我们所知,这是使用鉴定的Siamese二次SWIN(SQ-SWIN)变压器具有多个亮点的暹罗二次Swin(SQ-SWIN)变压器的深度学习的首先。首先,我们的模型在变压器模型中包含二次特征,该模型比线性变压器更强大地结合现实世界的表示。其次,提出了一种多尺度培训策略来增强数据并探索更多的生菜图像的固有自相似性。第三,提出的模型使用暹罗体系结构,该体系结构了解有限培训样本之间的相互关系。第四,该模型在成像网上进行了预估计,然后使用爬行动物元学习算法进行训练,以学习高阶梯度,而不是常规梯度。新鲜切割的生菜数据集的实验结果表明,所提出的SQ-SWIN优于传统方法和其他基于深度学习的骨架。
translated by 谷歌翻译
在本文中,我们提出了一种新的双通方法来统一一个模型中的流和非流媒体端到端(E2E)语音识别。我们的型号采用混合CTC /注意架构,其中编码器中的构装层被修改。我们提出了一种基于动态的块的注意力策略,以允许任意右上下文长度。在推理时间,CTC解码器以流式方式生成n最佳假设。只有更改块大小,可以轻松控制推理延迟。然后,CTC假设被注意力解码器重新筛选以获得最终结果。这种有效的备用过程导致句子级延迟非常小。我们在开放的170小时Aishell-1数据集上的实验表明,所提出的方法可以简单有效地统一流和非流化模型。在Aishell-1测试集上,与标准的非流式变压器相比,我们的统一模型在非流式ASR中实现了5.60%的相对字符错误率(CER)减少。同一模型在流式ASR系统中实现了5.42%的CER,640ms延迟。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
Graph Neural Networks (GNNs) have shown satisfying performance on various graph learning tasks. To achieve better fitting capability, most GNNs are with a large number of parameters, which makes these GNNs computationally expensive. Therefore, it is difficult to deploy them onto edge devices with scarce computational resources, e.g., mobile phones and wearable smart devices. Knowledge Distillation (KD) is a common solution to compress GNNs, where a light-weighted model (i.e., the student model) is encouraged to mimic the behavior of a computationally expensive GNN (i.e., the teacher GNN model). Nevertheless, most existing GNN-based KD methods lack fairness consideration. As a consequence, the student model usually inherits and even exaggerates the bias from the teacher GNN. To handle such a problem, we take initial steps towards fair knowledge distillation for GNNs. Specifically, we first formulate a novel problem of fair knowledge distillation for GNN-based teacher-student frameworks. Then we propose a principled framework named RELIANT to mitigate the bias exhibited by the student model. Notably, the design of RELIANT is decoupled from any specific teacher and student model structures, and thus can be easily adapted to various GNN-based KD frameworks. We perform extensive experiments on multiple real-world datasets, which corroborates that RELIANT achieves less biased GNN knowledge distillation while maintaining high prediction utility.
translated by 谷歌翻译